-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add voiceactivity action. #333
Conversation
This change adds support for the voice activity detection (VAD) feature for microphones. It allows application to show a notification when user is speaking but MediaStreamTrack is muted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments below, looks good overall to me.
We might want to add some text related to toggleMicrophone/Camera/voiceactivity source's target so that these actions are called on MediaSession objects whose context has capture MediaStreamTracks. This should be done in a separate PR though.
index.bs
Outdated
<p> | ||
A user agent MUST invoke {{MediaSessionActionHandler}} for | ||
{{MediaSessionAction/voiceactivity}} only when the voice activity is | ||
detected from a source with one or more live {{MediaStreamTrack}}s. A user |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably want to state that this is restricted to microphone tracks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed "source" to "microphone".
@@ -1496,6 +1513,7 @@ parameter whose dictionary type is: | |||
<li>{{MediaSessionActionDetails}} for {{MediaSessionAction/nextslide}}.</li> | |||
<li>{{MediaSessionActionDetails}} for | |||
{{MediaSessionAction/enterpictureinpicture}}.</li> | |||
<li>{{MediaSessionActionDetails}} for {{MediaSessionAction/voiceactivity}}.</li> | |||
</ul> | |||
|
|||
The <dfn dict-member for="MediaSessionActionDetails">action</dfn> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could an example that links voice activity with displaying a UI that can execute setMicrophoneActive(true)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
I also prefer a separate PR for toggleMicrophone/Camera, so this PR doesn't change anything other than voiceactivity action. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few comments inline. My bigger question is why expose this via Media Session and not on the media stream track? Also need to think about the privacy implications of allowing pages to detect voice activity on muted tracks.
index.bs
Outdated
the action's intent is to notify the action handler that a voice | ||
activity is started. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could the following be a bit clearer?
the action's intent is to notify the action handler that a voice | |
activity is started. | |
the action's intent is to notify the web page that voice | |
activity has been detected by the microphone. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a note to make it clearer.
index.bs
Outdated
@@ -541,6 +546,17 @@ platform UI or media keys, thereby improving the user experience. | |||
{{MediaSessionActionHandler}} before running, as different tasks, the | |||
steps defined to [$set a track's muted state$]. | |||
</p> | |||
<p> | |||
A user agent MUST invoke {{MediaSessionActionHandler}} for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A user agent MUST invoke {{MediaSessionActionHandler}} for | |
A user agent MUST invoke the {{MediaSessionActionHandler}} for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
index.bs
Outdated
@@ -541,6 +546,17 @@ platform UI or media keys, thereby improving the user experience. | |||
{{MediaSessionActionHandler}} before running, as different tasks, the | |||
steps defined to [$set a track's muted state$]. | |||
</p> | |||
<p> | |||
A user agent MUST invoke {{MediaSessionActionHandler}} for | |||
{{MediaSessionAction/voiceactivity}} only when the voice activity is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{{MediaSessionAction/voiceactivity}} only when the voice activity is | |
{{MediaSessionAction/voiceactivity}} only when voice activity is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
index.bs
Outdated
user agent MAY ignore a {{MediaSessionAction/voiceactivity}} action if all | ||
{{MediaStreamTrack}}s associated with the source are not | ||
{{MediaStreamTrack/muted}}. It is RECOMMENDED for user agents to set a | ||
minimal interval for invoking {{MediaSessionActionHandler}} for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest clarifying "minimal interval" - is this a time delay after voice activity is detected before the action handler is invoked? And how long does voice activity need to be present for? And does voice activity that comes and goes cause multiple invocations? (Not suggesting we spec these things, just clarify what we're recommending user agents to consider)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest clarifying "minimal interval" - is this a time delay after voice activity is detected before the action handler is invoked?
It's intended to be a minimal interval between two voiceactivity
actions (or events? event sounds to be more accurate here).
And how long does voice activity need to be present for? And does voice activity that comes and goes cause multiple invocations?
It actually depends on the voice activity detection (VAD) algorithm. Sometimes VAD algorithm may even consider background noise as a voice activity. Based on the use case (unmute microphone notification) we want to target, it is recommended not invoking this action handler too frequently.
A new note section is added for some background about this action.
Good question. There is some desire to expose whether there is voice info in live microphone capture.
Again, good question. If web applications are not able to help users unmute themselves, web applications will not ask to be muted. FWIW, in Safari, there is a capture indicator telling whether capture is running (red icon), capture is muted (bar gray icon) or capture is not running (no icon). This new action would only happen in the second case, not in the third one (much more problematic). |
Thanks for putting that to the agenda. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Chris, thanks for your review.
Also thanks to Youenn for providing detailed info about this action, and OS support.
index.bs
Outdated
@@ -541,6 +546,17 @@ platform UI or media keys, thereby improving the user experience. | |||
{{MediaSessionActionHandler}} before running, as different tasks, the | |||
steps defined to [$set a track's muted state$]. | |||
</p> | |||
<p> | |||
A user agent MUST invoke {{MediaSessionActionHandler}} for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
index.bs
Outdated
@@ -541,6 +546,17 @@ platform UI or media keys, thereby improving the user experience. | |||
{{MediaSessionActionHandler}} before running, as different tasks, the | |||
steps defined to [$set a track's muted state$]. | |||
</p> | |||
<p> | |||
A user agent MUST invoke {{MediaSessionActionHandler}} for | |||
{{MediaSessionAction/voiceactivity}} only when the voice activity is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
index.bs
Outdated
user agent MAY ignore a {{MediaSessionAction/voiceactivity}} action if all | ||
{{MediaStreamTrack}}s associated with the source are not | ||
{{MediaStreamTrack/muted}}. It is RECOMMENDED for user agents to set a | ||
minimal interval for invoking {{MediaSessionActionHandler}} for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest clarifying "minimal interval" - is this a time delay after voice activity is detected before the action handler is invoked?
It's intended to be a minimal interval between two voiceactivity
actions (or events? event sounds to be more accurate here).
And how long does voice activity need to be present for? And does voice activity that comes and goes cause multiple invocations?
It actually depends on the voice activity detection (VAD) algorithm. Sometimes VAD algorithm may even consider background noise as a voice activity. Based on the use case (unmute microphone notification) we want to target, it is recommended not invoking this action handler too frequently.
A new note section is added for some background about this action.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems reasonable to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more editorial suggestions, but overall this looks good. Thank you!
index.bs
Outdated
@@ -412,6 +412,11 @@ platform UI or media keys, thereby improving the user experience. | |||
the action's intent is to open the media session in a | |||
picture-in-picture window. | |||
</li> | |||
<li> | |||
<dfn enum-value for=MediaSessionAction>voiceactivity</dfn>: | |||
the action's intent is to notify the web page that a voice activity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the action's intent is to notify the web page that a voice activity | |
the action's intent is to notify the web page that voice activity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
index.bs
Outdated
agent MAY ignore a {{MediaSessionAction/voiceactivity}} action if all | ||
{{MediaStreamTrack}}s associated with the source are not | ||
{{MediaStreamTrack/muted}}. It is RECOMMENDED for user agents to set a | ||
minimal interval for invoking {{MediaSessionActionHandler}} for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minimal interval for invoking {{MediaSessionActionHandler}} for | |
minimal interval between invocations of the {{MediaSessionActionHandler}} for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
index.bs
Outdated
</p> | ||
|
||
<p class=note> | ||
{{MediaSessionAction/voiceactivity}} only indicates the start of a voice |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{{MediaSessionAction/voiceactivity}} only indicates the start of a voice | |
{{MediaSessionAction/voiceactivity}} only indicates the start of voice |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
index.bs
Outdated
|
||
<p class=note> | ||
{{MediaSessionAction/voiceactivity}} only indicates the start of a voice | ||
activity. Application may display a notification if the user is speaking |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
activity. Application may display a notification if the user is speaking | |
activity. Applications may display a notification if the user is speaking |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
index.bs
Outdated
{{MediaSessionAction/voiceactivity}} only indicates the start of a voice | ||
activity. Application may display a notification if the user is speaking | ||
while the {{MediaStreamTrack}} is muted, or start an {{AudioWorklet}} for | ||
audio processing. No action is defined for the end of a voice activity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
audio processing. No action is defined for the end of a voice activity. | |
audio processing. No action is defined for the end of voice activity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
index.bs
Outdated
activity. Application may display a notification if the user is speaking | ||
while the {{MediaStreamTrack}} is muted, or start an {{AudioWorklet}} for | ||
audio processing. No action is defined for the end of a voice activity. | ||
Unlike other actions which are explicitely triggered by the user, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unlike other actions which are explicitely triggered by the user, | |
Unlike other actions which are explicitly triggered by the user, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
index.bs
Outdated
efficiency concern, web page may not be notified if the second voice | ||
activity started soon after last {{MediaSessionAction/voiceactivity}} | ||
action. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
efficiency concern, web page may not be notified if the second voice | |
activity started soon after last {{MediaSessionAction/voiceactivity}} | |
action. | |
efficiency concerns, the web page may not be notified if voice | |
activity ends and restarts soon after the last {{MediaSessionAction/voiceactivity}} | |
action. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
MediaStreamTrack.muted is a readonly attribute. Replace it with enabled, which can be set by the application.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your comments.
index.bs
Outdated
@@ -412,6 +412,11 @@ platform UI or media keys, thereby improving the user experience. | |||
the action's intent is to open the media session in a | |||
picture-in-picture window. | |||
</li> | |||
<li> | |||
<dfn enum-value for=MediaSessionAction>voiceactivity</dfn>: | |||
the action's intent is to notify the web page that a voice activity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
index.bs
Outdated
agent MAY ignore a {{MediaSessionAction/voiceactivity}} action if all | ||
{{MediaStreamTrack}}s associated with the source are not | ||
{{MediaStreamTrack/muted}}. It is RECOMMENDED for user agents to set a | ||
minimal interval for invoking {{MediaSessionActionHandler}} for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
index.bs
Outdated
</p> | ||
|
||
<p class=note> | ||
{{MediaSessionAction/voiceactivity}} only indicates the start of a voice |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
index.bs
Outdated
|
||
<p class=note> | ||
{{MediaSessionAction/voiceactivity}} only indicates the start of a voice | ||
activity. Application may display a notification if the user is speaking |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
index.bs
Outdated
{{MediaSessionAction/voiceactivity}} only indicates the start of a voice | ||
activity. Application may display a notification if the user is speaking | ||
while the {{MediaStreamTrack}} is muted, or start an {{AudioWorklet}} for | ||
audio processing. No action is defined for the end of a voice activity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
index.bs
Outdated
activity. Application may display a notification if the user is speaking | ||
while the {{MediaStreamTrack}} is muted, or start an {{AudioWorklet}} for | ||
audio processing. No action is defined for the end of a voice activity. | ||
Unlike other actions which are explicitely triggered by the user, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
index.bs
Outdated
efficiency concern, web page may not be notified if the second voice | ||
activity started soon after last {{MediaSessionAction/voiceactivity}} | ||
action. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
index.bs
Outdated
from a microphone with one or more live {{MediaStreamTrack}}s. A user | ||
agent MAY ignore a {{MediaSessionAction/voiceactivity}} action if | ||
microphone is not muted and all {{MediaStreamTrack}}s associated with the | ||
source are {{MediaStreamTrack/enabled}}. It is RECOMMENDED for user agents |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced MediaStreamTrack.muted
with MediaStreamTrack.enabled
because muted is a readonly property. Application may want to set enabled
to true
if voice activity is detected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me.
I think we can add some wording stating the voiceactivity action source MUST always have a target whose document MUST always have live microphone {{MediaStreamTrack}}s. This will more closely reuse spec wording and tighten the fact that voiceacitvity cannot happen on a document that is not capturing microphone
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, a few more minor comments.
index.bs
Outdated
{{MediaSessionAction/voiceactivity}} only when voice activity is detected | ||
from a microphone with one or more live {{MediaStreamTrack}}s. A user | ||
agent MAY ignore a {{MediaSessionAction/voiceactivity}} action if | ||
microphone is not muted and all {{MediaStreamTrack}}s associated with the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
microphone is not muted and all {{MediaStreamTrack}}s associated with the | |
the microphone is not muted and all {{MediaStreamTrack}}s associated with the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
index.bs
Outdated
<p> | ||
A user agent MUST invoke the {{MediaSessionActionHandler}} for | ||
{{MediaSessionAction/voiceactivity}} only when voice activity is detected | ||
from a microphone with one or more live {{MediaStreamTrack}}s. A user |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If "live" here is the same as {{MediaStreamTrackState/live}}, we could link them:
from a microphone with one or more live {{MediaStreamTrack}}s. A user | |
from a microphone with one or more {{MediaStreamTrackState/live}} {{MediaStreamTrack}}s. A user |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
index.bs
Outdated
A user agent MUST invoke the {{MediaSessionActionHandler}} for | ||
{{MediaSessionAction/voiceactivity}} only when voice activity is detected | ||
from a microphone with one or more live {{MediaStreamTrack}}s. A user | ||
agent MAY ignore a {{MediaSessionAction/voiceactivity}} action if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Ignore" seems strange here and implies the action is generated somehow, but ignored. Could we say "A user agent MAY ignore voice activity if the microphone is not muted ..."?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. It sounds better.
Thanks for the suggestion. It's added to the beginning of the paragraph. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thank you!
Thanks for your review. Please help merge it if it's okay to merge.
index.bs
Outdated
A user agent MUST invoke the {{MediaSessionActionHandler}} for | ||
{{MediaSessionAction/voiceactivity}} only when voice activity is detected | ||
from a microphone with one or more live {{MediaStreamTrack}}s. A user | ||
agent MAY ignore a {{MediaSessionAction/voiceactivity}} action if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. It sounds better.
index.bs
Outdated
{{MediaSessionAction/voiceactivity}} only when voice activity is detected | ||
from a microphone with one or more live {{MediaStreamTrack}}s. A user | ||
agent MAY ignore a {{MediaSessionAction/voiceactivity}} action if | ||
microphone is not muted and all {{MediaStreamTrack}}s associated with the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
index.bs
Outdated
<p> | ||
A user agent MUST invoke the {{MediaSessionActionHandler}} for | ||
{{MediaSessionAction/voiceactivity}} only when voice activity is detected | ||
from a microphone with one or more live {{MediaStreamTrack}}s. A user |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Thanks! As they're the editors, I leave @youennf or @steimelchrome to merge. |
Merging given comments have been addressed. |
SHA: 0f6e693 Reason: push, by youennf Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
This change adds support for the voice activity detection (VAD) feature for microphones. It allows applications to show a notification when the user is speaking while the MediaStreamTrack is muted.
Related discussion in mediacapture-extensions spec: issue, pr, slide
Preview | Diff